AITopics | reward and cost

Constrained Best Arm Identification

Neural Information Processing SystemsJun-19-2026, 19:45:33 GMT

In real-world decision-making problems, one needs to pick among multiple policies the one that performs best while respecting economic constraints. This motivates the problem of constrained best-arm identification for bandit problems where every arm is a joint distribution of reward and cost. We investigate the general case where reward and cost are dependent. The goal is to accurately identify the arm with the highest mean reward among all arms whose mean cost is below a given threshold. We prove information-theoretic lower bounds on the sample complexity for three models: Gaussian with fixed covariance, Gaussian with unknown covariance, and non-parametric distributions of rectangular support. We propose a combination of a sampling and a stopping rule that correctly identifies the constrained best arm and matches the optimal sample complexities for each of the three models. Simulations demonstrate the performance of our algorithms.

constraint, data mining, machine learning, (21 more...)

Neural Information Processing Systems

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Data Science > Data Mining > Big Data (0.48)

Add feedback

Constrained Best Arm Identification

Neural Information Processing SystemsJun-13-2026, 07:40:45 GMT

In real-world decision-making problems, one needs to pick among multiple policies the one that performs best while respecting economic constraints. This motivates the problem of constrained best-arm identification for bandit problems where every arm is a joint distribution of reward and cost. We investigate the general case where reward and cost are dependent. The goal is to accurately identify the arm with the highest mean reward among all arms whose mean cost is below a given threshold. We prove information-theoretic lower bounds on the sample complexity for three models: Gaussian with fixed covariance, Gaussian with unknown covariance, and non-parametric distributions of rectangular support. We propose a combination of a sampling and a stopping rule that correctly identifies the constrained best arm and matches the optimal sample complexities for each of the three models. Simulations demonstrate the performance of our algorithms.

artificial intelligence, machine learning, proceedings, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.41)

Add feedback

e75a1c8af8d9438df1057fdaa42913eb-Paper-Conference.pdf

Neural Information Processing SystemsFeb-12-2026, 13:27:25 GMT

adaptive policy, bandit, constraint, (16 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > France > Île-de-France > Paris > Paris (0.04)

Genre: Research Report (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
Information Technology > Data Science (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

Contextual Bandits with Knapsacks for a Conversion Model

Neural Information Processing SystemsDec-25-2025, 14:26:08 GMT

We consider contextual bandits with knapsacks, with an underlying structure between rewards generated and cost vectors suffered. We do so motivated by sales with commercial discounts. At each round, given the stochastic i.i.d.\ context $\mathbf{x}_t$ and the arm picked $a_t$ (corresponding, e.g., to a discount level), a customer conversion may be obtained, in which case a reward $r(a,\mathbf{x}_t)$ is gained and vector costs $\mathbf{c}(a_t,\mathbf{x}_t)$ are suffered (corresponding, e.g., to losses of earnings). Otherwise, in the absence of a conversion, the reward and costs are null. The reward and costs achieved are thus coupled through the binary variable measuring conversion or the absence thereof. This underlying structure between rewards and costs is different from the linear structures considered by Agrawal and Devanur [2016] (but we show that the techniques introduced in the present article may also be applied to the case of these linear structures). The adaptive policies exhibited in this article solve at each round a linear program based on upper-confidence estimates of the probabilities of conversion given $a$ and $\mathbf{x}$. This kind of policy is most natural and achieves a regret bound of the typical order $(\mathrm{OPT}/B) \smash{\sqrt{T}}$, where $B$ is the total budget allowed, $\mathrm{OPT}$ is the optimal expected reward achievable by a static policy, and $T$ is the number of rounds.

contextual bandit, mathbf, name change, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.39)

Add feedback

5e88ccc6d03f08e426de9bb918aa1bca-Paper-Conference.pdf

Neural Information Processing SystemsNov-18-2025, 23:42:12 GMT

artificial intelligence, machine learning, reinforcement learning, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > Washington (0.04)
North America > United States > New Jersey (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.95)

Add feedback

Adversarially Trained Weighted Actor-Critic for Safe Offline Reinforcement Learning

Neural Information Processing SystemsOct-11-2025, 00:23:27 GMT

Additionally, we offer a practical version of WSAC and compare it with existing state-of-the-art safe offline RL algorithms in several continuous control environments.

algorithm, assumption, behavior policy, (12 more...)

Neural Information Processing Systems

Country:

North America > United States > Washington (0.04)
North America > United States > New Jersey (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry: Information Technology (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

e75a1c8af8d9438df1057fdaa42913eb-Paper-Conference.pdf

Neural Information Processing SystemsAug-19-2025, 15:11:01 GMT

adaptive policy, artificial intelligence, machine learning, (18 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > France > Île-de-France > Paris > Paris (0.04)

Genre: Research Report (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
Information Technology > Data Science (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

A New Benchmark for Online Learning with Budget-Balancing Constraints

Braverman, Mark, Liu, Jingyi, Mao, Jieming, Schneider, Jon, Xue, Eric

arXiv.org Artificial IntelligenceMar-18-2025

The adversarial Bandit with Knapsack problem is a multi-armed bandits problem with budget constraints and adversarial rewards and costs. In each round, a learner selects an action to take and observes the reward and cost of the selected action. The goal is to maximize the sum of rewards while satisfying the budget constraint. The classical benchmark to compare against is the best fixed distribution over actions that satisfies the budget constraint in expectation. Unlike its stochastic counterpart, where rewards and costs are drawn from some fixed distribution (Badanidiyuru et al., 2018), the adversarial BwK problem does not admit a no-regret algorithm for every problem instance due to the "spend-or-save" dilemma (Immorlica et al., 2022). A key problem left open by existing works is whether there exists a weaker but still meaningful benchmark to compare against such that no-regret learning is still possible. In this work, we present a new benchmark to compare against, motivated both by real-world applications such as autobidding and by its underlying mathematical structure. The benchmark is based on the Earth Mover's Distance (EMD), and we show that sublinear regret is attainable against any strategy whose spending pattern is within EMD $o(T^2)$ of any sub-pacing spending pattern. As a special case, we obtain results against the "pacing over windows" benchmark, where we partition time into disjoint windows of size $w$ and allow the benchmark strategies to choose a different distribution over actions for each window while satisfying a pacing budget constraint. Against this benchmark, our algorithm obtains a regret bound of $\tilde{O}(T/\sqrt{w}+\sqrt{wT})$. We also show a matching lower bound, proving the optimality of our algorithm in this important special case. In addition, we provide further evidence of the necessity of the EMD condition for obtaining a sublinear regret.

artificial intelligence, data mining, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2503.14796

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > Florida > Broward County > Fort Lauderdale (0.04)

Genre: Research Report (0.50)

Industry: Education > Educational Setting > Online (0.41)

Technology:

Information Technology > Data Science > Data Mining > Big Data (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)

Add feedback

Bayesian Graph Traversal

Caballero, William N., Jenkins, Phillip R., Banks, David, Robbins, Matthew

arXiv.org Artificial IntelligenceMar-7-2025

This research considers Bayesian decision-analytic approaches toward the traversal of an uncertain graph. Namely, a traveler progresses over a graph in which rewards are gained upon a node's first visit and costs are incurred for every edge traversal. The traveler knows the graph's adjacency matrix and his starting position but does not know the rewards and costs. The traveler is a Bayesian who encodes his beliefs about these values using a Gaussian process prior and who seeks to maximize his expected utility over these beliefs. Adopting a decision-analytic perspective, we develop sequential decision-making solution strategies for this coupled information-collection and network-routing problem. We show that the problem is NP-Hard and derive properties of the optimal walk. These properties provide heuristics for the traveler's problem that balance exploration and exploitation. We provide a practical case study focused on the use of unmanned aerial systems for public safety and empirically study policy performance in myriad Erdos-Renyi settings.

clairvoyant, node, traveler, (17 more...)

arXiv.org Artificial Intelligence

2503.05963

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > United States > North Carolina > Durham County > Durham (0.04)
(3 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Government > Regional Government > North America Government > United States Government (1.00)
Government > Military (1.00)
Transportation (0.89)

Technology:

Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles > Drones (0.48)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.46)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.34)

Add feedback

Contextual Bandits with Knapsacks for a Conversion Model

Neural Information Processing SystemsJan-19-2025, 04:29:54 GMT

We consider contextual bandits with knapsacks, with an underlying structure between rewards generated and cost vectors suffered. We do so motivated by sales with commercial discounts. At each round, given the stochastic i.i.d.\ context \mathbf{x}_t and the arm picked a_t (corresponding, e.g., to a discount level), a customer conversion may be obtained, in which case a reward r(a,\mathbf{x}_t) is gained and vector costs \mathbf{c}(a_t,\mathbf{x}_t) are suffered (corresponding, e.g., to losses of earnings). Otherwise, in the absence of a conversion, the reward and costs are null. The reward and costs achieved are thus coupled through the binary variable measuring conversion or the absence thereof.

contextual bandit, conversion model, mathbf, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.39)

Add feedback

Filters

Collaborating Authors

reward and cost

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Constrained Best Arm Identification

Constrained Best Arm Identification

e75a1c8af8d9438df1057fdaa42913eb-Paper-Conference.pdf

Contextual Bandits with Knapsacks for a Conversion Model

5e88ccc6d03f08e426de9bb918aa1bca-Paper-Conference.pdf

Adversarially Trained Weighted Actor-Critic for Safe Offline Reinforcement Learning

e75a1c8af8d9438df1057fdaa42913eb-Paper-Conference.pdf

A New Benchmark for Online Learning with Budget-Balancing Constraints

Bayesian Graph Traversal

Contextual Bandits with Knapsacks for a Conversion Model